Skip to content

feat: add bmad-eval-runner skill with isolation, dependency staging, and full docs#84

Merged
bmadcode merged 6 commits intomainfrom
eval-runner
May 10, 2026
Merged

feat: add bmad-eval-runner skill with isolation, dependency staging, and full docs#84
bmadcode merged 6 commits intomainfrom
eval-runner

Conversation

@bmadcode
Copy link
Copy Markdown
Contributor

Summary

Adds the bmad-eval-runner skill plus complete documentation. The runner evaluates a skill's behavior in an isolated workspace (Docker preferred, local fallback) and grades the result against eval-author expectations.

Skill (5 commits)

  • Initial skill: bmad-eval-runner with claude -p based execution, isolation strategies, and discovery
  • Credential staging + correct trigger detection: macOS Keychain credential staging into the sandbox; synthetic skill placed at .claude/skills/<unique>/SKILL.md so the Skill tool can actually fire
  • Setup overlay system: base (evals/setup/) and per-eval (evals/<id>/setup/) directories rsynced into the workspace before the skill stages, enabling dependency skills to be available inside the sandbox
  • Trigger fix: --dangerously-skip-permissions added to claude -p invocations so the Skill tool can read SKILL.md (fixes 0% trigger rate)
  • Per-eval timeout override: evals.json entries can set "timeout": N to override the runner's default

Docs

  • explanation/what-are-evals.md: artifact vs trigger evals; output vs transcript grading; best practices; worked example pointing at bmad-product-brief
  • explanation/why-bmad-eval-runner.md: isolation, dependency staging, trigger detection, permanent artifacts
  • how-to/install-docker-for-evals.md: Docker Desktop setup with credential-safety notes
  • how-to/run-evals-against-a-skill.md: 5-step run flow with worked example
  • reference/eval-format.md: complete schema (fixtures, setup overlays, per-eval timeout)
  • _diagrams/eval-test-types.excalidraw + Playwright renderer (render.mjs + render.html)
  • public/img/eval-test-types.png: rendered architecture diagram

Test Plan

  • Eval runner executed end-to-end against bmad-product-brief (17 artifact evals, all 17 ran; 16 passed, 1 timeout that was traced to a too-tight per-eval timeout, fixed by the new override field)
  • Trigger detection verified: synthetic skill firing observed in stream-JSON
  • Setup overlay confirmed: dependency skills (bmad-distillator, editorial review skills) available inside the sandbox
  • Docs validated: zero em dashes, zero banned vocabulary, all cross-doc links land
  • Diagram renders cleanly via Playwright (docs/_diagrams/render.mjs)

bmadcode added 5 commits May 9, 2026 13:42
New skill for running a target skill's evals in a clean, isolated environment.
Supports both artifact evals (evals.json with expectations) and trigger evals
(triggers.json with should_trigger). Adapted from Anthropic's skill-creator
eval pipeline (run_eval.py, grader.md, generate_review.py).

Isolation strategy:
  - Docker preferred: each eval runs in a fresh bmad-eval-runner:latest
    container with HOME pointed at an empty in-container dir, no host
    CLAUDE.md or auto-memory bleed-through. Image built on first run.
  - Local fallback: ~/bmad-evals/<run-id>/<eval-id>/ with HOME override
    to a clean .home/ directory. Best-effort; user is told.

Artifacts (transcript, files Claude wrote, metrics, grading) are retained
permanently per run so users can review what happened, not just whether
it passed.

Layout:
  SKILL.md                       outcome-driven entry
  references/isolation.md        Docker + local strategies
  references/eval-formats.md     evals.json + triggers.json schemas
  scripts/run_evals.py           artifact runner
  scripts/run_triggers.py        trigger runner (adapted from Anthropic)
  scripts/docker_setup.py        Docker detection + image build
  scripts/generate_report.py     aggregate HTML report
  scripts/utils.py               shared helpers
  agents/grader.md               judge subagent
  assets/Dockerfile              clean Claude Code image
Three fixes from running the runner end-to-end against bmad-product-brief:

1. Stage Claude Code OAuth credentials into each isolated workspace.
   Both isolation modes override HOME, so the subprocess can't read the
   host's ~/.claude/ and the macOS Keychain ACL prevents it from reading
   the credential directly. The parent process (which owns the ACL) now
   reads "Claude Code-credentials" via `security find-generic-password`
   once at import, then writes it as .credentials.json into each
   workspace's .claude/ before launching claude -p. ANTHROPIC_API_KEY
   passthrough still works as a fallback for non-macOS hosts.

2. Trigger detection: place the synthetic skill at .claude/skills/<name>/
   SKILL.md instead of .claude/commands/<name>.md. Slash commands do not
   surface as Skill tool calls, which is why the previous implementation
   (matching Anthropic's reference run_eval.py) reported 0% trigger rates
   for every should-trigger query. Real skills under .claude/skills/ do
   fire the Skill tool, letting the existing detector observe genuine
   trigger events.

3. Docker credential mount: write to a dedicated <eval-dir>/creds/
   directory so the container mount holds exactly one file at the
   expected path (`/creds/.credentials.json`). Mounting eval-dir directly
   would expose all run output and required the container to know an
   undocumented dot-prefix filename.

isolation.md and SKILL.md updated to document the auth flow, the
local-mode trigger leak (host's installed skills can bleed in via cwd
discovery despite HOME override — prefer Docker for triggers), and why
real-skill placement is correct vs. slash-command placement.

Multi-turn workflow handling for non-headless skills is still TODO.
- Setup overlay system: rsync evals/setup/ (base) and evals/<id>/setup/
  (per-eval) onto each workspace before skill staging, enabling dependency
  skills and _bmad/ config to be available inside the sandbox
- Add parse_skill_dependencies, discover_setup_dirs, apply_setup_overlay
  to utils.py; wire through run_evals.py for both local and Docker modes
- Fix 0% trigger rate: add --dangerously-skip-permissions to all claude -p
  invocations in run_triggers.py (without it Skill tool cannot read SKILL.md)
- Upgrade grader.md with richer transcript parsing guidance (tool-call
  patterns, phase ordering, read-only enforcement, JSON block extraction)
- Expand eval-formats.md reference with setup overlay and dependency docs
- Default workers bumped to 8
- Add pty_runner.py (experimental; not wired into main flow)
…ad-eval-runner

- explanation/what-are-evals.md: artifact vs trigger evals; output vs transcript grading
- explanation/why-bmad-eval-runner.md: isolation, dependency staging, real triggers, permanent artifacts
- how-to/install-docker-for-evals.md: Docker Desktop setup with credential-safety notes
- how-to/run-evals-against-a-skill.md: 5-step run flow with brief eval suite as worked example
- reference/eval-format.md: complete schema for evals.json + triggers.json (fixtures, setup overlays, per-eval timeout)
- _diagrams/eval-test-types.excalidraw: source diagram with Playwright renderer (render.mjs + render.html)
- public/img/eval-test-types.png: rendered architecture diagram embedded in what-are-evals.md
- update explanation/index.md and reference/index.md sidebars
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 10, 2026

Warning

Rate limit exceeded

@bmadcode has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 58 minutes and 5 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a03bb12a-a2d5-4ead-a84c-b2ac2e2efa46

📥 Commits

Reviewing files that changed from the base of the PR and between 86033fc and 12effa6.

⛔ Files ignored due to path filters (1)
  • website/public/img/eval-test-types.png is excluded by !**/*.png
📒 Files selected for processing (22)
  • docs/_diagrams/README.md
  • docs/_diagrams/eval-test-types.excalidraw
  • docs/_diagrams/render.html
  • docs/_diagrams/render.mjs
  • docs/explanation/index.md
  • docs/explanation/what-are-evals.md
  • docs/explanation/why-bmad-eval-runner.md
  • docs/how-to/install-docker-for-evals.md
  • docs/how-to/run-evals-against-a-skill.md
  • docs/reference/eval-format.md
  • docs/reference/index.md
  • skills/bmad-eval-runner/SKILL.md
  • skills/bmad-eval-runner/agents/grader.md
  • skills/bmad-eval-runner/assets/Dockerfile
  • skills/bmad-eval-runner/references/eval-formats.md
  • skills/bmad-eval-runner/references/isolation.md
  • skills/bmad-eval-runner/scripts/docker_setup.py
  • skills/bmad-eval-runner/scripts/generate_report.py
  • skills/bmad-eval-runner/scripts/pty_runner.py
  • skills/bmad-eval-runner/scripts/run_evals.py
  • skills/bmad-eval-runner/scripts/run_triggers.py
  • skills/bmad-eval-runner/scripts/utils.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch eval-runner

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented May 10, 2026

🤖 Augment PR Summary

Summary: Adds the bmad-eval-runner skill to run a skill’s artifact and trigger eval suites in isolated workspaces and emit a permanent, inspectable run report.

Changes:

  • Introduced the bmad-eval-runner skill definition plus a dedicated grader subagent prompt.
  • Implemented Python runners for artifact evals (run_evals.py) and trigger evals (run_triggers.py) with Docker-preferred isolation and a local fallback.
  • Added Docker image management (docker_setup.py) and a minimal runner image (assets/Dockerfile).
  • Added an aggregate HTML report generator (generate_report.py) that combines execution results, per-eval grading, and trigger rates.
  • Added/updated docs covering eval concepts, why isolation matters, Docker installation, how to run evals, and the full eval schema.
  • Added diagram sources and a Playwright-based renderer to produce committed PNGs for the documentation.

Technical Notes: Workspaces are staged via project rsync + setup overlays + fixtures; runs capture stream-JSON transcripts and support per-eval timeout overrides.

🤖 Was this summary useful? React with 👍 or 👎

@bmadcode bmadcode merged commit 72628e2 into main May 10, 2026
4 checks passed
Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 5 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread docs/_diagrams/render.mjs Outdated

const sceneJson = JSON.parse(readFileSync(inPath, "utf-8"));

const htmlPath = resolve(fileURLToPath(import.meta.url), "..", "excalidraw_render.html");
Copy link
Copy Markdown

@augmentcode augmentcode Bot May 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

htmlPath points at excalidraw_render.html, but this PR adds docs/_diagrams/render.html, so the renderer won’t be able to page.goto() the HTML it expects and will fail/hang.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

workspace_snapshot_before = snapshot_files(workspace_project)

home_dir = workspace_root / ".home"
stage_credentials(home_dir / ".claude", _KEYCHAIN_CREDS)
Copy link
Copy Markdown

@augmentcode augmentcode Bot May 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staging the macOS Keychain OAuth JSON into the per-eval run directory (workspace/.home/.claude/.credentials.json and eval_dir/creds/.credentials.json) appears to persist credentials in the “artifacts are forever” run folder, which is a significant secret-leak risk if runs are backed up or shared.

Severity: high

Other Locations
  • skills/bmad-eval-runner/scripts/run_evals.py:232
  • skills/bmad-eval-runner/scripts/run_triggers.py:148
  • skills/bmad-eval-runner/scripts/run_triggers.py:225

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

if e.stderr:
stderr_tail += "\n" + e.stderr.decode("utf-8", errors="replace")[-2000:]

new_files = diff_workspace(workspace_project, workspace_snapshot_before)
Copy link
Copy Markdown

@augmentcode augmentcode Bot May 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local artifact capture only includes newly-created paths (after - before), so edits to existing files (e.g., Update/Validate flows) won’t be reflected in artifacts/ for grading; in Docker mode the container script rsyncs the entire workspace (including the whole project), which can massively bloat runs and dilute what the skill actually produced.

Severity: medium

Other Locations
  • skills/bmad-eval-runner/scripts/run_evals.py:259

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

for src in setup_dirs:
if not src.is_dir():
continue
subprocess.run(
Copy link
Copy Markdown

@augmentcode augmentcode Bot May 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_setup_overlay() shells out to rsync unconditionally and ignores failures (check=False), so on hosts without rsync (or if rsync errors) overlays can silently not apply and dependency staging can fail in hard-to-debug ways.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

pending_tool = name
accumulated_json = ""
else:
return False, ""
Copy link
Copy Markdown

@augmentcode augmentcode Bot May 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_stream_for_trigger() returns False immediately when it sees a tool_use that isn’t Skill/Read (and also returns False after the first assistant event lacking the tool), which can create false negatives if the synthetic skill fires later in the stream.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

@bmadcode bmadcode deleted the eval-runner branch May 10, 2026 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant